Preliminaries

activity-3.Rmd

  • Create activity-3.Rmd in your in-class directory
  • Use output: github_document
  • Insert in R chunk to load tidyverse and babynames

FYI: caching

The output of r chunks can be cached (saved) so that they don't need to be re-run everytime you re-knit.

#```{r compute-stuff, cache = TRUE}
# my crazy intensive computing step
#```
  • You may want to add cached files to your .gitignore.
  • Labeling chunks can help with debugging
  • Many more chunk options are available (search: knitr chunk options)

Data Wrangling

Two paradigms

R

cyl4s <- mtcars[mtcars$cyl == 4, ]
mean(cyl4s$mpg)
## [1] 26.66364

R with dplyr

library(dplyr)
cyl4s <- filter(mtcars, cyl == 4)
summarize(cyl4s, mean = mean(mpg))
##       mean
## 1 26.66364

R Subsetting

cyl4s <- mtcars[mtcars$cyl == 4, ]
  • Square brackets are used to subset dataframes.
  • [rows, columns]

R Summaries

mean(cyl4s$mpg)
## [1] 26.66364
  • Many commands operate on a vector (column) of data extracted from a dataframe using the $.

dplyr Subsetting

cyl4s <- filter(mtcars, cyl == 4)
  • dplyr contains separate functions to filter the rows and select the columns.
  • The dataframe is one argument and the column names are another (no $).
cyl4s <- mtcars %>% filter(cyl == 4)
  • Can also chain commands using the pipe.

dplyr Summaries

summarize(cyl4s, mean = mean(mpg))
##       mean
## 1 26.66364
  • Any numerical summary that you want to apply to a column of a dataframe is specified within summarize().
cyl4s %>% summarize(mean = mean(mpg))
##       mean
## 1 26.66364
  • This, too, can be put into a chain.

Two paradigms

R

cyl4s <- mtcars[mtcars$cyl == 4, ]
mean(cyl4s$mpg)
## [1] 26.66364

R with dplyr

library(dplyr)
cyl4s <- filter(mtcars, cyl == 4)
summarize(cyl4s, mean = mean(mpg))
##       mean
## 1 26.66364

Why dplyr?

Data sets are often of high volume (lots of rows) and high variety (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also often need to perform operations to organize and clean our data.

This is all possible in base R, but with dplyr, it is simple, readible, and fast.

The Five Verbs

  • select()
  • filter()
  • mutate()
  • arrange()
  • summarize()

Philosophy

  • Each verb takes a data frame and returns a data frame
    • actually a tbl_df
    • allows chaining with %>%
  • Idea:
    • master a few simple commands
    • use your creativity to combine them
  • Cheat sheet:

select()

Subset the columns.

filter()

Subset the rows.

mutate()

Add or modify a column.

rename()

Rename a column.

arrange()

Sort the rows.

summarize()

Summarize column(s) into a single row.

Activity 3.1

  • Add a new column to babynames that indicates whether or not the name proportion in that yearxgender was greater than 1%. Call that column is_popular.
  • Which are the least popular "popular" names?

Activity 3.1

  • Add a new column to babynames that indicates whether or not the name proportion in that yearxgender was greater than 1%. Call that column is_popular.
  • Which are the least popular "popular" names?
babynames %>%
  mutate(is_popular = prop > .01) %>%
  filter(is_popular == TRUE) %>%
  arrange(prop)
## # A tibble: 3,875 x 6
##     year sex   name         n   prop is_popular
##    <dbl> <chr> <chr>    <int>  <dbl> <lgl>     
##  1  1894 F     Grace     2361 0.0100 T         
##  2  1969 F     Karen    17637 0.0100 T         
##  3  1983 F     Crystal  17904 0.0100 T         
##  4  2000 F     Madison  19967 0.0100 T         
##  5  1960 M     Gary     21688 0.0100 T         
##  6  1911 F     Florence  4424 0.0100 T         
##  7  1967 M     Eric     17828 0.0100 T         
##  8  1957 M     Larry    21913 0.0100 T         
##  9  2003 M     Daniel   21032 0.0100 T         
## 10  1934 M     Frank    10635 0.0100 T         
## # ... with 3,865 more rows

Activity 3.2

  • Create a new object called bella that contains all instances of babies named "Bella" and all of the variables except prop.
  • What class() is this object?
  • Create a plot of the popularity of Bella over time.

Activity 3.2

  • Create a new object called bella that contains all instances of babies named "Bella" and all of the variables except prop.
  • What class() is this object?
  • Create a plot of the popularity of Bella over time.
bella <- babynames %>%
  select(year, sex, name, n) %>%
  filter(name == "Bella")

or

bella <- babynames %>%
  select(-prop) %>%
  filter(name == "Bella")
class(bella)
## [1] "tbl_df"     "tbl"        "data.frame"

babynames %>%
  select(year, sex, name, n) %>%
  filter(name == "Bella") %>%
  ggplot(aes(x = year, y = n)) + 
  geom_line(aes(color = sex))

Composition

  • Pipe (%>%) is provided by the magrittr package
  • Inspired by pipe (|) in UNIX

Using the pipe

The expression

mydata %>%
  verb(arg1)

is the same as:

verb(mydata, arg1)
  • It takes the output of the leading function and drops it in as the first argument of the trailing function.
  • You can indicate the landing site with .

Why the pipe?

Instead of having to read/write:

select(filter(mutate(data, arg1), arg2), arg3)

You can do:

data %>%
  mutate(arg1) %>%
  filter(arg2) %>%
  select(arg3)

Coding Little Bunny Foo Foo

Practice

Activity 3.3

  • Without using summarize() or graphics, find the year in which your name was used most frequently.
  • What was the most popular name that year?

Activity 3.3

  • Without using summarize() or graphics, find the year in which your name was used most frequently.
  • What was the most popular name that year?
babynames %>%
  filter(name == "Andrew") %>%
  arrange(desc(n))
## # A tibble: 245 x 5
##     year sex   name       n   prop
##    <dbl> <chr> <chr>  <int>  <dbl>
##  1  1987 M     Andrew 36199 0.0186
##  2  1988 M     Andrew 35727 0.0179
##  3  1989 M     Andrew 34809 0.0166
##  4  1986 M     Andrew 33853 0.0176
##  5  1990 M     Andrew 33653 0.0156
##  6  1991 M     Andrew 31525 0.0149
##  7  1992 M     Andrew 30534 0.0146
##  8  1985 M     Andrew 30160 0.0157
##  9  1993 M     Andrew 27314 0.0132
## 10  1994 M     Andrew 26006 0.0128
## # ... with 235 more rows

Activity 3.3

  • Without using summarize() or graphics, find the year in which your name was used most frequently.
  • What was the most popular name that year?
babynames %>%
  filter(year == 1987) %>%
  arrange(desc(n))
## # A tibble: 21,399 x 5
##     year sex   name            n   prop
##    <dbl> <chr> <chr>       <int>  <dbl>
##  1  1987 M     Michael     63638 0.0327
##  2  1987 F     Jessica     55990 0.0299
##  3  1987 F     Ashley      54845 0.0293
##  4  1987 M     Christopher 54476 0.0279
##  5  1987 M     Matthew     46473 0.0238
##  6  1987 F     Amanda      41786 0.0223
##  7  1987 M     Joshua      39991 0.0205
##  8  1987 M     David       36776 0.0189
##  9  1987 M     Andrew      36199 0.0186
## 10  1987 M     Daniel      35857 0.0184
## # ... with 21,389 more rows

Aggregation

group_by()

Indicate sub-groups of rows in a data set.

group_by()

Indicate sub-groups of rows in a data set.

Total number of names in each year

babynames %>%
  group_by(year) %>%
  summarize(count = n())
## # A tibble: 136 x 2
##     year count
##    <dbl> <int>
##  1  1880  2000
##  2  1881  1935
##  3  1882  2127
##  4  1883  2084
##  5  1884  2297
##  6  1885  2294
##  7  1886  2392
##  8  1887  2373
##  9  1888  2651
## 10  1889  2590
## # ... with 126 more rows

Activity 3.4

  • Which year had the greatest number of births?
  • In a single pipeline, compute the earliest and latest year that each name appears.

Activity 3.4

  • Which year had the greatest number of births?
  • In a single pipeline, compute the earliest and latest year that each name appears.
babynames %>%
  group_by(year) %>%
  summarize(num_births = sum(n)) %>%
  arrange(desc(num_births)) 
## # A tibble: 136 x 2
##     year num_births
##    <dbl>      <int>
##  1  1957    4200146
##  2  1959    4156617
##  3  1960    4154877
##  4  1961    4140040
##  5  1958    4131784
##  6  1956    4121274
##  7  1962    4035565
##  8  1955    4012757
##  9  2007    3992280
## 10  1954    3979147
## # ... with 126 more rows

Activity 3.4

  • Which year had the greatest number of births?
  • In a single pipeline, compute the earliest and latest year that each name appears.
babynames %>%
  group_by(name) %>%
  summarize(earliest = min(year), latest = max(year))
## # A tibble: 95,025 x 3
##    name      earliest latest
##    <chr>        <dbl>  <dbl>
##  1 Aaban         2007   2015
##  2 Aabha         2011   2015
##  3 Aabid         2003   2003
##  4 Aabriella     2008   2015
##  5 Aada          2015   2015
##  6 Aadam         1987   2015
##  7 Aadan         2003   2015
##  8 Aadarsh       2000   2015
##  9 Aaden         2001   2015
## 10 Aadesh        2005   2011
## # ... with 95,015 more rows

More . . .

Choose two of the following exercises to add to your activity and also please post a graphic to slack (the last bullet).

  • 3.5: Among popular names (let’s say at least 1% of the births in a given year), which name is the youngest—meaning that its first appearance as a popular name is the most recent?
  • 3.6: It seems like there is more diversity of names now than in the past. How have the number of names used changed over time? Has it been the same for boys and girls?
  • 3.7: Find the most popular names of the 1990s.
  • 3.8: Which names have been given to M and F most equally?
  • Before class thursday, please create a visualization of this data set and post it (and it's code) to slack.

From slack

  • Message: there is a greater diversity of names now than in the past.
  • The choice of measure matters a bit.

  • Message: Most of the 2015 popular names are relatively new. Exception: William.
  • Technique: focus on the long term trend of a subset that are remarkable in a smaller time interval.

  • Message: Biblical names are becoming rarer, though boys names got a bump in the second half of the 20th century.

  • Message: Control for factors that you're not interested in.
  • My curiosity here was seeing when there were the most people per name. It made sense to me that when there are less people there are less names, so I wanted to see at what era we essentially saw a population boom without a coinciding 'name quantity' boom.

dplyr, some notes

distinct()

Return the rows with duplicate labels.

babynames %>%
  distinct(name)
## # A tibble: 95,025 x 1
##    name     
##    <chr>    
##  1 Mary     
##  2 Anna     
##  3 Emma     
##  4 Elizabeth
##  5 Minnie   
##  6 Margaret 
##  7 Ida      
##  8 Alice    
##  9 Bertha   
## 10 Sarah    
## # ... with 95,015 more rows

So many n's

babynames %>%
  filter(name == "Jackie") %>%
  group_by(year) %>%
  summarize(n(), 
            sum(n), 
            sum(ifelse(sex == "M", n, 0)))

What will this return?

So many n's

babynames %>%
  filter(name == "Jackie") %>%
  group_by(year) %>%
  summarize(n(), 
            sum(n), 
            sum(ifelse(sex == "M", n, 0)))
## # A tibble: 117 x 4
##     year `n()` `sum(n)` `sum(ifelse(sex == "M", n, 0))`
##    <dbl> <int>    <int>                           <dbl>
##  1  1899     1        9                            0   
##  2  1900     1       12                            0   
##  3  1901     1        7                            0   
##  4  1902     1        6                            0   
##  5  1903     1       12                            0   
##  6  1904     2       25                            5.00
##  7  1905     1        9                            0   
##  8  1906     1       14                            0   
##  9  1907     1       15                            0   
## 10  1908     2       28                            5.00
## # ... with 107 more rows

Add clarity with names

babynames %>%
  filter(name == "Jackie") %>%
  rename(n_births = n) %>%
  group_by(year) %>%
  summarize(number_of_rows = n(), 
            n_total_births = sum(n_births), 
            n_boy_births = sum(ifelse(sex == "M", n_births, 0)))
## # A tibble: 117 x 4
##     year number_of_rows n_total_births n_boy_births
##    <dbl>          <int>          <int>        <dbl>
##  1  1899              1              9         0   
##  2  1900              1             12         0   
##  3  1901              1              7         0   
##  4  1902              1              6         0   
##  5  1903              1             12         0   
##  6  1904              2             25         5.00
##  7  1905              1              9         0   
##  8  1906              1             14         0   
##  9  1907              1             15         0   
## 10  1908              2             28         5.00
## # ... with 107 more rows

Joins

Relational data

Consider two tables with rows identified by their key.

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)
x
## # A tibble: 3 x 2
##     key val_x
##   <dbl> <chr>
## 1  1.00 x1   
## 2  2.00 x2   
## 3  3.00 x3
y
## # A tibble: 3 x 2
##     key val_y
##   <dbl> <chr>
## 1  1.00 y1   
## 2  2.00 y2   
## 3  4.00 y3

An inner join returns the rows that have a key in both tables, with the combined columns from both tables.

x %>%
  inner_join(y, by = c("key" = "key"))
## # A tibble: 2 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  2.00 x2    y2

A left join returns the rows that have a key in the left table, with the combined columns from both tables, filling in NAs where there are no key matches in the right table.

x %>%
  left_join(y, by = c("key" = "key"))
## # A tibble: 3 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  2.00 x2    y2   
## 3  3.00 x3    <NA>

A right join returns the rows that have a key in the right table, with the combined columns from both tables, filling in NAs where there are no key matches in the left table.

x %>%
  right_join(y, by = c("key" = "key"))
## # A tibble: 3 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  2.00 x2    y2   
## 3  4.00 <NA>  y3

A full join returns the rows that have a key either table, with the combined columns from both tables, filling in NAs where there are no key matches in the corresponding table.

x %>%
  full_join(y, by = c("key" = "key"))
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  2.00 x2    y2   
## 3  3.00 x3    <NA> 
## 4  4.00 <NA>  y3

Joins as set operations

Sometimes helpful but runs into problems when keys don't uniquely identify an observation.

Many to one

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
x
## # A tibble: 4 x 2
##     key val_x
##   <dbl> <chr>
## 1  1.00 x1   
## 2  2.00 x2   
## 3  2.00 x3   
## 4  1.00 x4
y
## # A tibble: 2 x 2
##     key val_y
##   <dbl> <chr>
## 1  1.00 y1   
## 2  2.00 y2

What type of join is this?

x %>%
  inner_join(y, key = c("key", "key"))
## Joining, by = "key"
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  2.00 x2    y2   
## 3  2.00 x3    y2   
## 4  1.00 x4    y1
x %>%
  right_join(y, key = c("key", "key"))
## Joining, by = "key"
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  1.00 x4    y1   
## 3  2.00 x2    y2   
## 4  2.00 x3    y2

x %>%
  left_join(y, key = c("key", "key"))
## Joining, by = "key"
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  2.00 x2    y2   
## 3  2.00 x3    y2   
## 4  1.00 x4    y1
x %>%
  full_join(y, key = c("key", "key"))
## Joining, by = "key"
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  2.00 x2    y2   
## 3  2.00 x3    y2   
## 4  1.00 x4    y1

Activity 4

In the babynames package there are actually several tables. Start by renaming births, which has data from the US Census.

births
## # A tibble: 119 x 2
##     year  births
##    <int>   <int>
##  1  1909 2718000
##  2  1910 2777000
##  3  1911 2809000
##  4  1912 2840000
##  5  1913 2869000
##  6  1914 2966000
##  7  1915 2965000
##  8  1916 2964000
##  9  1917 2944000
## 10  1918 2948000
## # ... with 109 more rows
census_births <- births

The babynames data is from the social security administration. Let's subset (to make things interesting), then get it into the same form as census_births.

ssa_births <- babynames %>%
  filter(year < 2012) %>%
  group_by(year) %>%
  summarize(N = n(), births = sum(n))

Activity 4

  • 4.1: Create a new table using an inner_join(). What does this table represent? What time span is covered by this table?
  • 4.2: Create a new table using a left_join(). What does this table represent? What time span is covered by this table?
  • 4.3: Create a new table using a right_join(). What does this table represent? What time span is covered by this table?
  • 4.4: Which of the previous two tables has more missing values?
  • 4.5: Create a new table using a full_join(). What does this table represent? What time span is covered by this table?
  • 4.6: Using the most appropriate table, explore the fidelity between the SSA and Census counts. If they differ, why do you think this is?